{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "    <p style=\"text-align:center\">\n",
    "        <img alt=\"phoenix logo\" src=\"https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg\" width=\"200\"/>\n",
    "        <br>\n",
    "        <a href=\"https://arize.com/docs/phoenix/\">Docs</a>\n",
    "        |\n",
    "        <a href=\"https://github.com/Arize-ai/phoenix\">GitHub</a>\n",
    "        |\n",
    "        <a href=\"https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email\">Community</a>\n",
    "    </p>\n",
    "</center>\n",
    "<h1 align=\"center\">Tracing and Evaluating a LangChain Application</h1>\n",
    "\n",
    "LangChain provides high-level APIs that enable users to build powerful applications in a few lines of code. However, it can be challenging to understand what is going on under the hood and to pinpoint the cause of issues. Phoenix makes your LLM applications *observable* by visualizing the underlying structure of each call to your query engine and surfacing problematic \"spans\" of execution based on latency, token count, or other evaluation metrics.\n",
    "\n",
    "In this tutorial, you will:\n",
    "- Build a simple question and answer app using LangChain that uses retrieval-augmented generation to answer questions over the Arize documentation,\n",
    "- Record trace data in OpenInference format,\n",
    "- Inspect the traces and spans of your application to identify sources of latency and cost,\n",
    "- Export your trace data as a pandas dataframe and run an LLM-assisted evaluation to measure the precision@k of your retrieval step.\n",
    "\n",
    "ℹ️ This notebook requires an OpenAI API key."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install Dependencies and Import Libraries\n",
    "\n",
    "Install Phoenix, LangChain, and OpenAI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"langchain>=0.1.0\" langchain-community langchain-openai \"openai>=1\" 'httpx<0.28' \"arize-phoenix[evals]\" tiktoken nest-asyncio openinference-instrumentation-langchain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import libraries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "from getpass import getpass\n",
    "from urllib.request import urlopen\n",
    "\n",
    "import nest_asyncio\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from langchain.chains import RetrievalQA\n",
    "from langchain.retrievers import KNNRetriever\n",
    "from langchain_openai import ChatOpenAI, OpenAIEmbeddings\n",
    "from openinference.instrumentation.langchain import LangChainInstrumentor\n",
    "from tqdm import tqdm\n",
    "\n",
    "import phoenix as px\n",
    "from phoenix.evals import (\n",
    "    HallucinationEvaluator,\n",
    "    OpenAIModel,\n",
    "    QAEvaluator,\n",
    "    RelevanceEvaluator,\n",
    "    run_evals,\n",
    ")\n",
    "from phoenix.otel import register\n",
    "from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents\n",
    "from phoenix.trace import DocumentEvaluations\n",
    "\n",
    "nest_asyncio.apply()  # needed for concurrent evals in notebook environments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Launch Phoenix\n",
    "\n",
    "You can run Phoenix in the background to collect trace data emitted by any LangChain application that has been instrumented with the `OpenInferenceTracer`.\n",
    "\n",
    "Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run a LangChain application)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "(session := px.launch_app()).view()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configure Your OpenAI API Key\n",
    "\n",
    "Set your OpenAI API key if it is not already set as an environment variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "if os.environ.get(\"OPENAI_API_KEY\") is None:\n",
    "    openai_api_key = getpass(\"🔑 Enter your OpenAI API key: \")\n",
    "    os.environ[\"OPENAI_API_KEY\"] = openai_api_key"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4. Build Your LangChain Application\n",
    "\n",
    "This example uses a `RetrievalQA` chain over a pre-built index of the Arize documentation, but you can use whatever LangChain application you like.\n",
    "\n",
    "Download your pre-built index from cloud storage and instantiate your storage context."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.read_parquet(\n",
    "    \"http://storage.googleapis.com/arize-phoenix-assets/datasets/\"\n",
    "    \"unstructured/llm/context-retrieval/langchain/database.parquet\"\n",
    ")\n",
    "knn_retriever = KNNRetriever(\n",
    "    index=np.stack(df[\"text_vector\"]),\n",
    "    texts=df[\"text\"].tolist(),\n",
    "    embeddings=OpenAIEmbeddings(),\n",
    ")\n",
    "chain_type = \"stuff\"  # stuff, refine, map_reduce, and map_rerank\n",
    "chat_model_name = \"gpt-3.5-turbo\"\n",
    "llm = ChatOpenAI(model_name=chat_model_name)\n",
    "chain = RetrievalQA.from_chain_type(\n",
    "    llm=llm,\n",
    "    chain_type=chain_type,\n",
    "    retriever=knn_retriever,\n",
    "    metadata={\"application_type\": \"question_answering\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instrument LangChain"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tracer_provider = register()\n",
    "LangChainInstrumentor(tracer_provider=tracer_provider).instrument(skip_dep_check=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Run Your Query Engine and View Your Traces in Phoenix\n",
    "\n",
    "Download a sample of queries commonly asked of the Arize documentation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl\"\n",
    "queries = []\n",
    "with urlopen(url) as response:\n",
    "    for line in response:\n",
    "        line = line.decode(\"utf-8\").strip()\n",
    "        data = json.loads(line)\n",
    "        queries.append(data[\"query\"])\n",
    "queries[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Run a few queries."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for query in tqdm(queries[:10]):\n",
    "    chain.invoke(query)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check the Phoenix UI as your queries run. Your traces should appear in real time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Export and Evaluate Your Trace Data\n",
    "\n",
    "You can export your trace data as a pandas dataframe for further analysis and evaluation.\n",
    "\n",
    "In this case, we will export our `retriever` spans into two separate dataframes:\n",
    "- `queries_df`, in which the retrieved documents for each query are concatenated into a single column,\n",
    "- `retrieved_documents_df`, in which each retrieved document is \"exploded\" into its own row to enable the evaluation of each query-document pair in isolation.\n",
    "\n",
    "This will enable us to compute multiple kinds of evaluations, including:\n",
    "- relevance: Are the retrieved documents grounded in the response?\n",
    "- Q&A correctness: Are your application's responses grounded in the retrieved context?\n",
    "- hallucinations: Is your application making up false information?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "queries_df = get_qa_with_reference(px.Client())\n",
    "retrieved_documents_df = get_retrieved_documents(px.Client())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, define your evaluation model and your evaluators.\n",
    "\n",
    "Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "eval_model = OpenAIModel(\n",
    "    model=\"gpt-4-turbo-preview\",\n",
    ")\n",
    "hallucination_evaluator = HallucinationEvaluator(eval_model)\n",
    "qa_correctness_evaluator = QAEvaluator(eval_model)\n",
    "relevance_evaluator = RelevanceEvaluator(eval_model)\n",
    "\n",
    "hallucination_eval_df, qa_correctness_eval_df = run_evals(\n",
    "    dataframe=queries_df,\n",
    "    evaluators=[hallucination_evaluator, qa_correctness_evaluator],\n",
    "    provide_explanation=True,\n",
    ")\n",
    "relevance_eval_df = run_evals(\n",
    "    dataframe=retrieved_documents_df,\n",
    "    evaluators=[relevance_evaluator],\n",
    "    provide_explanation=True,\n",
    ")[0]\n",
    "\n",
    "from phoenix.client import AsyncClient\n",
    "\n",
    "px_client = AsyncClient()\n",
    "await px_client.spans.log_span_annotations_dataframe(\n",
    "    dataframe=hallucination_eval_df,\n",
    "    annotation_name=\"Hallucination\",\n",
    "    annotator_kind=\"LLM\",\n",
    ")\n",
    "await px_client.spans.log_span_annotations_dataframe(\n",
    "    dataframe=qa_correctness_eval_df,\n",
    "    annotation_name=\"QA Correctness\",\n",
    "    annotator_kind=\"LLM\",\n",
    ")\n",
    "\n",
    "px.Client().log_evaluations(\n",
    "    DocumentEvaluations(eval_name=\"Relevance\", dataframe=relevance_eval_df),\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Your evaluations should now appear as annotations on the appropriate spans in Phoenix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"🚀 Open the Phoenix UI if you haven't already: {session.url}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Final Thoughts\n",
    "\n",
    "LLM Traces and the accompanying OpenInference Tracing specification is designed to be a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context such as retrieval from vector stores and the usage of external tools such as search engines or APIs. It lets you understand the inner workings of the individual steps your application takes wile also giving you visibility into how your system is running and performing as a whole.\n",
    "\n",
    "LLM Evals are designed for simple, fast, and accurate LLM-based evaluations. They let you quickly benchmark the performance of your LLM application and help you identify the problematic spans of execution.\n",
    "\n",
    "For more details on Phoenix, LLM Tracing, and LLM Evals, checkout our [documentation](https://arize.com/docs/phoenix/)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
